# Open-vocabulary recognition

OPENCLIP SigLIP Tiny 14 Distill SigLIP 400m Cc9m
MIT
A lightweight vision-language model based on the SigLIP architecture, extracting knowledge from the larger SigLIP-400m model through distillation techniques, suitable for zero-shot image classification tasks.
Image Classification
O
PumeTu
30
0
Llmdet Swin Tiny Hf
Apache-2.0
LLMDet is a powerful open-vocabulary object detector supervised by large language models, capable of zero-shot object detection.
Object Detection
L
fushh7
2,451
0
Eva02 Large Patch14 Clip 224.merged2b
MIT
The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.
Image Classification
E
timm
165
0
Eva02 Enormous Patch14 Clip 224.laion2b Plus
MIT
EVA-CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.
Text-to-Image
E
timm
54
0
Vit Huge Patch14 Clip 224.metaclip Altogether
CLIP model based on ViT-Huge architecture, supporting zero-shot image classification tasks
Image Classification
V
timm
171
1
Resnet101 Clip.openai
MIT
A CLIP model based on ResNet101 architecture, supporting zero-shot image classification tasks.
Image Classification
R
timm
2,717
0
Owlv2 Large Patch14 Ensemble
Apache-2.0
OWLv2 is a zero-shot text-conditioned object detection model that can detect objects in images through text queries.
Text-to-Image Transformers
O
Thomasboosinger
1
0
Owlv2 Base Patch16
Apache-2.0
OWLv2 is a zero-shot text-conditioned object detection model that can detect and locate objects in images through text queries.
Text-to-Image Transformers
O
vvmnnnkv
26
0
Owlv2 Large Patch14 Finetuned
Apache-2.0
OWLv2 is a zero-shot text-conditioned object detection model that can detect objects in images through text queries without requiring category-specific training data.
Text-to-Image Transformers
O
google
1,434
4
Owlv2 Base Patch16 Finetuned
Apache-2.0
OWLv2 is a zero-shot text-conditioned object detection model that can retrieve objects in images through text queries.
Object Detection Transformers
O
google
2,698
3
CLIP ViT L 14 CommonPool.XL.clip S13b B90k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
C
laion
534
1
CLIP ViT B 32 CommonPool.M.clip S128m B4k
MIT
Zero-shot image classification model based on CLIP architecture, supporting general pooling functionality
Image-to-Text
C
laion
164
0
Eva02 Large Patch14 Clip 224.merged2b S4b B131k
MIT
EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Image Classification
E
timm
5,696
6
Owlvit Base Patch32
Apache-2.0
OWL-ViT is a zero-shot text-conditioned object detection model that can search for objects in images via text queries without requiring category-specific training data.
Text-to-Image Transformers
O
google
764.95k
129
Clip Vit Base Patch32
CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.
Image-to-Text
C
openai
14.0M
666
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase